Introduction

This study is a partial replication of Bradlow, Bassard, and Paller’s (2023) investigation of perceptual adaptation to second-language (L2) speech. The original study examined the conditions that facilitate or constrain perceptual adaptation to L2 speech, with a focus on how variability in training exposure affects generalization of this adaptation. The authors tested whether high-variability (i.e., multiple-talker) training is necessary for cross-talker generalization or whether low-variability (i.e., single-talker) training can be sufficient for such generalization.

This replication focuses specifically on the training phase of the original study. I aim to test the prediction that some single-talker training conditions can lead to significant improvement in L2 speech recognition, suggesting that high-variability exposure is not necessary for perceptual adaptation. Furthermore, I will examine whether training phase intelligibility plays a crucial role in perceptual adaptation to L2 speech, as suggested by the original study’s post-hoc analysis.

Methods

Power Analysis

The original study found various effect sizes across different training conditions. For single-talker training conditions, improvements relative to the untrained control condition ranged from 10.2% (FAR training with BRP test talker, p < 0.003) to non-significant negative effects (-4.6% for TUR training with FAR test talker). Given our use of 30 trials compared to the original study’s 60 trials, we adjusted expected effect sizes downward by approximately 50% to account for reduced measurement precision.

# Define effect sizes (adjusted for 30 vs 60 trials)
effect_sizes <- tibble(
  effect_type = c("Small", "Medium", "Large", "Bradlow-Low", "Bradlow-High"),
  original_d = c(0.30, 0.50, 0.80, 0.60, 1.00),
  adjusted_d = c(0.15, 0.25, 0.40, 0.30, 0.50)
)

# Simulated power results for different sample sizes
power_results <- tibble(
  n_per_condition = c(100, 150, 200, 250, 300, 350, 400, 450, 500),
  Small = c(0.189, 0.252, 0.325, 0.398, 0.431, 0.521, 0.555, 0.604, 0.671),
  Medium = c(0.431, 0.584, 0.702, 0.803, 0.866, 0.903, 0.938, 0.959, 0.976),
  Large = c(0.802, 0.932, 0.987, 0.993, 0.997, 0.999, 1.000, 1.000, 1.000),
  `Bradlow-Low` = c(0.548, 0.747, 0.842, 0.921, 0.950, 0.981, 0.988, 0.994, 0.999),
  `Bradlow-High` = c(0.935, 0.991, 0.999, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000)
)

# Display power for n=200 (our target)
target_power <- power_results %>%
  filter(n_per_condition == 200) %>%
  pivot_longer(cols = -n_per_condition, names_to = "Effect", values_to = "Power") %>%
  left_join(effect_sizes %>% select(effect_type, adjusted_d), 
            by = c("Effect" = "effect_type")) %>%
  mutate(Power_pct = sprintf("%.1f%%", Power * 100))

format_table(target_power %>% select(Effect, adjusted_d, Power_pct),
      col.names = c("Effect Type", "Cohen's d", "Power"),
      caption = "Statistical power with n=200 per condition")
Statistical power with n=200 per condition
Effect Type Cohen’s d Power
Small 0.15 32.5%
Medium 0.25 70.2%
Large 0.40 98.7%
Bradlow-Low 0.30 84.2%
Bradlow-High 0.50 99.9%
# Create power curve plot
power_long <- power_results %>%
  pivot_longer(cols = -n_per_condition, names_to = "Effect", values_to = "Power")

ggplot(power_long, aes(x = n_per_condition, y = Power, color = Effect)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2) +
  geom_hline(yintercept = c(0.8, 0.9), linetype = "dashed", alpha = 0.5) +
  geom_vline(xintercept = 200, linetype = "dotted", color = "red", linewidth = 1) +
  scale_y_continuous(breaks = seq(0, 1, 0.1), limits = c(0, 1)) +
  scale_color_viridis_d() +
  labs(x = "Sample Size per Condition", 
       y = "Statistical Power",
       title = "Power Analysis for Perceptual Adaptation Effects",
       subtitle = "Red line indicates planned sample size (n=200 per condition)") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  annotate("text", x = 210, y = 0.05, label = "n=200", 
           color = "red", hjust = 0, fontface = "bold")

# Summary statement
cat("\nPower Analysis Summary:
With n=200 participants per condition, we have:
- ", sprintf("%.0f%%", target_power$Power[target_power$Effect == "Small"] * 100), " power to detect small effects (d=0.15)
- ", sprintf("%.0f%%", target_power$Power[target_power$Effect == "Medium"] * 100), " power to detect medium effects (d=0.25)
- ", sprintf("%.0f%%", target_power$Power[target_power$Effect == "Large"] * 100), " power to detect large effects (d=0.40)
- ", sprintf("%.0f%%", target_power$Power[target_power$Effect == "Bradlow-Low"] * 100), " power to detect Bradlow-equivalent low effects (d=0.30)
- ", sprintf("%.0f%%", target_power$Power[target_power$Effect == "Bradlow-High"] * 100), " power to detect Bradlow-equivalent high effects (d=0.50)

Note: For generalization conditions (testing only 25% of trials), effect sizes and power are approximately halved.\n", sep="")
## 
## Power Analysis Summary:
## With n=200 participants per condition, we have:
## - 32% power to detect small effects (d=0.15)
## - 70% power to detect medium effects (d=0.25)
## - 99% power to detect large effects (d=0.40)
## - 84% power to detect Bradlow-equivalent low effects (d=0.30)
## - 100% power to detect Bradlow-equivalent high effects (d=0.50)
## 
## Note: For generalization conditions (testing only 25% of trials), effect sizes and power are approximately halved.

Planned Sample

Based on our power analysis, we planned to recruit approximately 1,200 L1 English speakers (200 per condition × 6 conditions) for this study. This sample size provides 84% power to detect medium-to-large effects (d=0.30, equivalent to the lower range of effects found in Bradlow et al., 2023).

Actual recruitment through Prolific yielded 1,370 complete submissions (no early timeouts or attention check failures during the experiment). However, after applying our preregistered exclusion criteria during data analysis, 917 valid participants remained. Of these, 834 (90.9%) were native English speakers who form the primary analysis sample, with 83 non-native English speakers analyzed separately for comparison purposes. The native speaker sample provides approximately 139 participants per condition, which still ensures: - 72% power for Bradlow-equivalent low effects (d=0.30) - 54% power for medium effects (d=0.25) - 91% power for large effects (d=0.40)

Participants were between 18 and 35 years of age, from the US, UK, and Canada, with no self-reported deficits in speech, language, or hearing, and with normal or corrected-to-normal vision. All participants confirmed they were using headphones or earbuds and Google Chrome browser. Participants were compensated at an average rate of $10.23/hour, with a median completion time of 8 minutes and 13 seconds.

Materials

While the original study used materials from the ALLSSTAR Corpus, our replication used sentence recordings from the L2-ARCTIC corpus due to its accessibility and comprehensive documentation. The L2-ARCTIC corpus includes recordings from 24 non-native speakers of English from six L1 backgrounds (Hindi, Korean, Mandarin, Spanish, Arabic, and Vietnamese), with each L1 represented by two male and two female speakers.

For our study, we selected 15 L2 talkers from six different L1 backgrounds, ensuring balanced representation by speaker gender. These talkers were selected based on moderate-to-good comprehensibility and distinct L2-accented speech, similar to the criteria used in the original study.

We compiled a set of 30 unique sentences from the corpus that met the following criteria: - Duration between 2.0 and 4.45 seconds - No proper nouns (to avoid spelling confusion) - Recorded by all selected speakers - Free from recording artifacts or quality issues

The sentences were presented as audio only (no visual text) and were not mixed with noise, differing from the original study’s use of speech-shaped noise at 0 dB SNR. This change was made to reduce cognitive load given the addition of time constraints in our paradigm.

Procedure

Participants listened to sentence recordings over headphones or earbuds. The sentences were presented one at a time with no possibility of repetition. Participants typed what they heard using the computer keyboard and could begin typing while the audio was playing. However, they could not advance to the next trial until the audio finished playing. After audio completion, participants had 15 seconds to finish typing their response before automatic progression to the next trial.

All responses were automatically formatted to lowercase and punctuation was removed (except apostrophes) to reduce orthographic variability and focus on speech perception accuracy. No feedback was provided during the experiment.

Participants were randomly assigned to one of six experimental conditions:

  1. Single-single-same: Same speaker throughout all 30 trials
  2. Single-single-diff-same-variety: One speaker for training (trials 1-15), different speaker of same L1 background for testing (trials 16-30)
  3. Single-single-diff-diff-variety: One speaker for training, different speaker of different L1 background for testing
  4. Single-multi-excl-single: Single speaker for training, multiple speakers (excluding training speaker) for testing
  5. Multi-multi-all-random: Random speaker selection for each trial throughout
  6. Multi-excl-single-single: Multiple speakers (excluding one) for training, the excluded speaker for testing

Two attention check trials were inserted at trials 16 and 31, requiring participants to type a specific word from a clearly articulated sentence. Participants who failed both attention checks or had two “strikes” (timeouts or failed attention checks) before trial 16 were excluded from analysis. Note that the main experiment consisted of 30 content trials plus these 2 attention checks for a total of 32 trials.

Exit Survey

After completing all trials, participants completed a brief demographic survey collecting: - First language - Time learning English (if non-native) - Country where English was learned - Other languages spoken - Gender

Differences from Original Study

Key differences from Bradlow et al. (2023): - Corpus: L2-ARCTIC instead of ALLSSTAR - Stimuli: 30 sentences vs 60 in original - Noise: No added noise (original used 0 dB SNR) - Timing: 15-second response window with time pressure - Delay: No 11-hour delay between training and testing phases - Conditions: 6 conditions vs multiple experiments in original - Response format: Full sentence transcription vs keyword identification - Platform: Web-based (jsPsych) vs laboratory setting

Analysis Plan

Primary Analyses: - Mixed-effects logistic regression: Accuracy ~ Condition × Phase + (1|Participant) + (1|Item) - Character Error Rate (CER) as primary outcome measure, converted to Accuracy (1-CER) for interpretability - Planned contrasts testing adaptation benefits relative to baseline (multi-multi condition) - Effect sizes calculated as Cohen’s d - Main analyses restricted to native English speakers, with separate comparison of native vs non-native performance

Secondary Analyses: - Speaker-level random effects to quantify talker variability - Learning curves across trials within each phase - Correlation between training and testing performance - Impact of L1 background on intelligibility - Distribution of accuracy scores across all trials

Results

Data Preparation

# Load the data
df_main_all <- read_csv("~/Documents/PerceptualAdaptation/data/df_main.csv", show_col_types = FALSE)

# Convert CER to accuracy
df_main_all <- df_main_all %>%
  mutate(accuracy = 1 - cer)

# Create native speaker indicator
df_main_all <- df_main_all %>%
  mutate(
    is_native_english = grepl("^[Ee]n", tolower(first_language)),
    native_status = ifelse(is_native_english, "Native", "Non-Native")
  )

# Check distribution before filtering
native_counts_all <- df_main_all %>%
  distinct(participant_id, is_native_english) %>%
  count(is_native_english) %>%
  mutate(percentage = sprintf("%.1f%%", n/sum(n) * 100))

# FILTER TO KEEP ONLY NATIVE SPEAKERS FOR MAIN ANALYSES
df_main <- df_main_all %>%
  filter(is_native_english == TRUE)

# Define condition labels for plotting
condition_labels <- c(
  'single-single-same' = 'Same Speaker',
  'single-single-diff-same-variety' = 'Different Speaker (Same Variety)',
  'single-single-diff-diff-variety' = 'Different Speaker (Diff Variety)',
  'single-multi-excl-single' = 'Single→Multi',
  'multi-multi-all-random' = 'Multi→Multi',
  'multi-excl-single-single' = 'Multi→Single'
)

# Define condition colors
condition_colors <- c(
  'single-single-same' = '#2E86AB',
  'single-single-diff-same-variety' = '#A23B72',
  'single-single-diff-diff-variety' = '#F18F01',
  'single-multi-excl-single' = '#C73E1D',
  'multi-multi-all-random' = '#6A994E',
  'multi-excl-single-single' = '#BC4B51'
)

Data Overview: - Initial Prolific submissions: 1,370 (no early timeouts or attention check failures) - After applying exclusion criteria: 917 valid participants - Exclusion rate: 33.1%

Participant Language Background: - Native English speakers: 834 (90.9%) - Non-native English speakers: 83 (9.1%)

After filtering to native speakers only: - Dataset dimensions: 25,020 rows, 44 columns - Number of participants: 834 - Number of speakers: 15 - Average participants per condition: 139

Confirmatory Analysis

Overall Performance by Condition and Phase (Native Speakers Only)

# Calculate means by condition and phase
summary_stats <- df_main %>%
  group_by(condition, phase) %>%
  summarise(
    mean_accuracy = mean(accuracy),
    se_accuracy = sd(accuracy) / sqrt(n()),
    n = n(),
    .groups = 'drop'
  )

# Create separate tables for each phase and join
training_stats <- summary_stats %>%
  filter(phase == "Training") %>%
  mutate(
    condition_label = condition_labels[condition],
    training_acc = sprintf("%.1f%% (±%.1f%%)", mean_accuracy * 100, se_accuracy * 100)
  ) %>%
  select(condition_label, training_acc, training_n = n)

testing_stats <- summary_stats %>%
  filter(phase == "Testing") %>%
  mutate(
    condition_label = condition_labels[condition],
    testing_acc = sprintf("%.1f%% (±%.1f%%)", mean_accuracy * 100, se_accuracy * 100)
  ) %>%
  select(condition_label, testing_acc, testing_n = n)

# Join the tables
summary_table <- training_stats %>%
  left_join(testing_stats, by = "condition_label") %>%
  select(condition_label, training_acc, testing_acc, training_n, testing_n)

format_table(summary_table, 
            col.names = c("Condition", "Training Accuracy", "Testing Accuracy", "N (Training)", "N (Testing)"))
Condition Training Accuracy Testing Accuracy N (Training) N (Testing)
Multi→Single 86.5% (±0.4%) 88.4% (±0.3%) 2235 2235
Multi→Multi 85.4% (±0.4%) 86.0% (±0.4%) 1965 1965
Single→Multi 86.4% (±0.4%) 86.9% (±0.4%) 2085 2085
Different Speaker (Diff Variety) 88.4% (±0.3%) 88.5% (±0.3%) 2205 2205
Different Speaker (Same Variety) 85.5% (±0.4%) 86.3% (±0.4%) 2010 2010
Same Speaker 88.5% (±0.4%) 89.9% (±0.3%) 2010 2010

Primary Visualization 1: Absolute Adaptation Benefit by Condition (Native Speakers)

# Calculate adaptation benefit for each participant
adaptation_data <- df_main %>%
  group_by(condition, participant_id, phase) %>%
  summarise(mean_accuracy = mean(accuracy), .groups = 'drop') %>%
  pivot_wider(names_from = phase, values_from = mean_accuracy) %>%
  mutate(
    adaptation_benefit = Testing - Training
  ) %>%
  filter(!is.na(Training) & !is.na(Testing))

# Calculate condition-level statistics
adaptation_summary <- adaptation_data %>%
  group_by(condition) %>%
  summarise(
    mean_benefit = mean(adaptation_benefit),
    se_benefit = sd(adaptation_benefit) / sqrt(n()),
    n = n(),
    raw_benefits = list(adaptation_benefit)
  ) %>%
  mutate(
    benefit_pct = mean_benefit * 100,
    se_pct = se_benefit * 100
  )

# Calculate overall mean
overall_mean <- mean(adaptation_summary$mean_benefit) * 100

# Add condition labels and arrange
adaptation_summary <- adaptation_summary %>%
  mutate(condition_label = condition_labels[condition]) %>%
  arrange(benefit_pct)

# Create the plot with absolute values
p1 <- ggplot(adaptation_summary, aes(x = reorder(condition_label, benefit_pct), 
                                     y = benefit_pct)) +
  geom_bar(stat = "identity", aes(fill = benefit_pct), 
           color = "black", linewidth = 1, alpha = 0.8) +
  geom_errorbar(aes(ymin = benefit_pct - se_pct, 
                    ymax = benefit_pct + se_pct),
                width = 0.3, linewidth = 1) +
  scale_fill_gradient2(low = "#d73027", mid = "#ffffbf", high = "#1a9850", 
                       midpoint = 0, guide = "none") +
  geom_hline(yintercept = 0, linetype = "solid", linewidth = 1) +
  geom_hline(yintercept = overall_mean, linetype = "dashed", linewidth = 1, color = "blue") +
  coord_flip() +
  labs(
    x = "",
    y = "Adaptation Benefit (%)",
    title = "Absolute Adaptation Benefit by Condition",
    subtitle = sprintf("Blue dashed line shows overall mean (%.2f%%)", overall_mean)
  ) +
  scale_y_continuous(limits = c(-2.5, 3.5)) +
  theme_minimal(base_size = 14) +
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    axis.text = element_text(size = 12),
    plot.title = element_text(size = 16, face = "bold")
  )

# Add significance tests against zero
for(i in 1:nrow(adaptation_summary)) {
  row <- adaptation_summary[i,]
  benefits <- unlist(row$raw_benefits)
  if(length(benefits) > 1) {
    t_test <- t.test(benefits, mu = 0)
    if(t_test$p.value < 0.05) {
      stars <- ifelse(t_test$p.value < 0.001, "***",
                     ifelse(t_test$p.value < 0.01, "**", "*"))
      y_pos <- row$benefit_pct + sign(row$benefit_pct) * (row$se_pct + 0.2)
      p1 <- p1 + annotate("text", x = i, y = y_pos, label = stars, 
                          size = 6, fontface = "bold")
    }
  }
}

# Add sample sizes
for(i in 1:nrow(adaptation_summary)) {
  p1 <- p1 + annotate("text", x = i, y = -2.3, 
                      label = paste0("n=", adaptation_summary$n[i]), 
                      size = 3, color = "gray50")
}

print(p1)

All conditions showed positive adaptation effects (mean = 0.89%), indicating general perceptual learning across the experiment. However, the magnitude of adaptation varied considerably by condition:

Highest Adaptation: Multi→Single condition (1.89%) - Training with multiple speakers prepared listeners exceptionally well for a single novel speaker, suggesting that varied input creates robust and flexible representations.

Talker-Specific Benefit: Same Speaker condition (1.49%) - Continued exposure to the same speaker yielded substantial gains, supporting talker-specific adaptation mechanisms.

Moderate Adaptation: Same-Variety (0.79%) and Multi→Multi (0.64%) - These conditions showed reliable but modest improvements.

Minimal Adaptation: Single→Multi (0.43%) and Different-Variety (0.10%) - Limited improvement suggests difficulty generalizing from single-speaker training to multiple speakers, and surprisingly little benefit from matched L1 backgrounds.

Note: *, **, *** indicate p < .05, .01, .001 respectively (test against zero)

Primary Visualization 2: Accuracy by Single-Speaker Conditions (Native Speakers)

# Focus on single-speaker conditions for H1
h1_conditions <- c('single-single-same', 'single-single-diff-same-variety', 
                   'single-single-diff-diff-variety')

# Get testing phase trial-level data for these conditions (not aggregated)
h1_data_trials <- df_main %>%
  filter(condition %in% h1_conditions & phase == "Testing") %>%
  mutate(condition_label = condition_labels[condition])

# Calculate summary statistics for error bars
h1_summary <- h1_data_trials %>%
  group_by(condition, condition_label) %>%
  summarise(
    mean = mean(accuracy),
    se = sd(accuracy) / sqrt(n()),
    n = n(),
    .groups = 'drop'
  )

# Get participant-level data for pairwise tests
h1_data <- df_main %>%
  filter(condition %in% h1_conditions & phase == "Testing") %>%
  group_by(condition, participant_id) %>%
  summarise(mean_accuracy = mean(accuracy), .groups = 'drop')

# Perform pairwise t-tests
comparisons <- list(
  c("single-single-same", "single-single-diff-same-variety"),
  c("single-single-same", "single-single-diff-diff-variety"),
  c("single-single-diff-same-variety", "single-single-diff-diff-variety")
)

p_values <- map_dbl(comparisons, function(comp) {
  data1 <- h1_data %>% filter(condition == comp[1]) %>% pull(mean_accuracy)
  data2 <- h1_data %>% filter(condition == comp[2]) %>% pull(mean_accuracy)
  t.test(data1, data2)$p.value
})

# Create violin plot with all trial-level data
p2 <- ggplot(h1_data_trials, aes(x = condition_label, y = accuracy)) +
  geom_violin(aes(fill = condition), alpha = 0.7, scale = "width") +
  geom_jitter(alpha = 0.1, size = 0.8, width = 0.2, color = "gray30") +
  
  # Add mean and error bars
  geom_point(data = h1_summary, aes(x = condition_label, y = mean), 
             size = 4, color = "black") +
  geom_errorbar(data = h1_summary, 
                aes(x = condition_label, y = mean, ymin = mean - se, ymax = mean + se),
                width = 0.2, linewidth = 1, color = "black") +
  
  scale_fill_manual(values = condition_colors[h1_conditions], guide = "none") +
  
  labs(
    x = NULL,
    y = "Testing Phase Accuracy",
    title = "Talker-Specific Adaptation",
    subtitle = "Testing phase performance by training-test speaker relationship (each point = one trial)"
  ) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), 
                     limits = c(0, 1),
                     breaks = seq(0, 1, 0.1)) +
  theme_minimal(base_size = 14) +
  theme(
    panel.grid.major.x = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_line(color = "gray90", linewidth = 0.5),
    axis.text.x = element_text(size = 12, color = "gray20"),
    axis.text.y = element_text(size = 11, color = "gray20"),
    axis.title.y = element_text(size = 13, margin = margin(r = 10)),
    plot.title = element_text(size = 18, face = "bold", color = "gray10"),
    plot.subtitle = element_text(size = 13, color = "gray40", margin = margin(b = 15)),
    plot.background = element_rect(fill = "white", color = NA),
    panel.background = element_rect(fill = "white", color = NA)
  )

print(p2)
## Warning: Removed 1263 rows containing missing values or values outside the scale range
## (`geom_point()`).

This analysis tests whether training with a specific speaker provides advantages when tested with that same speaker versus novel speakers:

Same Speaker Advantage: The Same Speaker condition achieved 89.9% accuracy, significantly outperforming the Same-Variety condition (86.3%, p = .001). This 3.6 percentage point advantage demonstrates robust talker-specific perceptual tuning.

L1 Variety Effects: Surprisingly, there was no significant difference between Same Speaker and Different-Variety conditions (88.5%, p = .114), suggesting that L1 background may be less important than expected.

Variety Comparison (Testing H2): The Same-Variety condition performed significantly worse than the Different-Variety condition (p = .038). This directly tests H2 (variety-general adaptation) and shows the opposite of what was predicted - shared L1 background actually hindered rather than facilitated cross-talker generalization.

These results provide partial support for H1 (talker-specific adaptation exists but only relative to same-variety conditions) and evidence against H2 (L1 variety does not facilitate generalization as predicted).

Testing H2: Variety-General Adaptation

# Focus on conditions that test variety effects
h2_conditions <- c('single-single-diff-same-variety', 'single-single-diff-diff-variety')

# Get data for both phases to examine adaptation patterns
h2_data <- df_main %>%
  filter(condition %in% h2_conditions) %>%
  group_by(condition, participant_id, phase) %>%
  summarise(mean_accuracy = mean(accuracy), .groups = 'drop') %>%
  mutate(condition_label = condition_labels[condition])

# Get participant counts for legend
n_per_condition <- h2_data %>%
  distinct(condition, participant_id) %>%
  count(condition) %>%
  mutate(condition_label = condition_labels[condition])

# Calculate phase means for plotting
h2_summary <- h2_data %>%
  group_by(condition_label, phase) %>%
  summarise(
    mean = mean(mean_accuracy),
    se = sd(mean_accuracy) / sqrt(n()),
    n = n(),
    .groups = 'drop'
  ) %>%
  mutate(phase = factor(phase, levels = c("Training", "Testing")))

# Create interaction plot with n in legend
p_h2 <- ggplot(h2_summary, aes(x = phase, y = mean, color = condition_label, group = condition_label)) +
  geom_line(linewidth = 2) +
  geom_point(size = 4) +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se),
                width = 0.1, linewidth = 1) +
  scale_color_manual(
    values = c(
      "Different Speaker (Same Variety)" = "#A23B72",
      "Different Speaker (Diff Variety)" = "#F18F01"
    ),
    labels = c(
      "Different Speaker (Same Variety)" = paste0("Different Speaker (Same Variety) (n = ", 
                                                  n_per_condition$n[n_per_condition$condition_label == "Different Speaker (Same Variety)"], ")"),
      "Different Speaker (Diff Variety)" = paste0("Different Speaker (Diff Variety) (n = ", 
                                                  n_per_condition$n[n_per_condition$condition_label == "Different Speaker (Diff Variety)"], ")")
    ),
    name = "Condition"
  ) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1),
                     limits = c(0.84, 0.90)) +
  labs(
    x = "Phase",
    y = "Mean Accuracy",
    title = "Testing H2: Variety-General Adaptation",
    subtitle = "Does shared L1 background facilitate cross-talker generalization?"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "top",
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray90"),
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 13),
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 13, color = "gray40")
  )

print(p_h2)

# Calculate adaptation benefits for each condition
h2_adaptation <- h2_data %>%
  pivot_wider(names_from = phase, values_from = mean_accuracy) %>%
  mutate(adaptation = Testing - Training) %>%
  group_by(condition_label) %>%
  summarise(
    mean_adaptation = mean(adaptation, na.rm = TRUE),
    se_adaptation = sd(adaptation, na.rm = TRUE) / sqrt(n()),
    n = n()
  )

# Test difference in adaptation between conditions
same_variety_adapt <- h2_data %>%
  filter(condition == "single-single-diff-same-variety") %>%
  pivot_wider(names_from = phase, values_from = mean_accuracy) %>%
  pull(Testing - Training)

diff_variety_adapt <- h2_data %>%
  filter(condition == "single-single-diff-diff-variety") %>%
  pivot_wider(names_from = phase, values_from = mean_accuracy) %>%
  pull(Testing - Training)

adapt_test <- t.test(same_variety_adapt, diff_variety_adapt)

H2 Analysis Summary:

The variety-general adaptation hypothesis (H2) predicts that training on speakers from one L1 background should facilitate better generalization to new speakers from the same L1 background compared to speakers from different L1 backgrounds.

Results contradict H2: - Same-Variety condition: Training 85.5% → Testing 86.3% (adaptation: +0.8%) - Different-Variety condition: Training 88.4% → Testing 88.5% (adaptation: +0.1%) - Testing phase comparison: Different-Variety (88.5%) > Same-Variety (86.3%), p = .038

The Different-Variety condition maintained higher accuracy throughout and showed less need for adaptation. The significant difference in testing phase performance (p = .038) runs counter to H2’s prediction, suggesting that L1 background matching does not facilitate cross-talker generalization and may even hinder it.

Mixed-Effects Model Analysis (Native Speakers)

# Check if nlme is available for mixed models
if(has_nlme) {
  # Prepare data for mixed model
  model_data <- df_main %>%
    mutate(
      condition = factor(condition),
      phase = factor(phase),
      participant_id = factor(participant_id),
      stimulus_id = factor(stimulus_id),
      speaker_id = factor(speaker_id),
      trial_in_phase = ifelse(phase == "Training", overall_trial_number, overall_trial_number - 15)
    )

  # Fit mixed-effects model using nlme
  library(nlme)
  model <- lme(accuracy ~ condition * phase, 
               random = ~ 1 | participant_id, 
               data = model_data)

  # DETAILED MODEL OUTPUT
  cat("=== DETAILED MIXED EFFECTS MODEL OUTPUT ===\n\n")
  
  # Full model summary
  model_summary <- summary(model)
  
  # Extract and format fixed effects
  cat("FIXED EFFECTS:\n")
  cat("─────────────────────────────────────────────────────────────────────────\n")
  fixed_effects <- model_summary$tTable
  
  # Format the output with interpretable names
  effect_names <- rownames(fixed_effects)
  for(i in 1:nrow(fixed_effects)) {
    effect_name <- effect_names[i]
    coef <- fixed_effects[i, "Value"]
    se <- fixed_effects[i, "Std.Error"]
    df <- fixed_effects[i, "DF"]
    t_val <- fixed_effects[i, "t-value"]
    p_val <- fixed_effects[i, "p-value"]
    
    # Add significance stars
    sig_stars <- ifelse(p_val < 0.001, "***", 
                        ifelse(p_val < 0.01, "**", 
                               ifelse(p_val < 0.05, "*", "")))
    
    cat(sprintf("%-50s β = %7.4f (SE = %.4f), t(%d) = %6.2f, p = %.4f %s\n",
                effect_name, coef, se, df, t_val, p_val, sig_stars))
  }
  
  cat("\n")
  
  # Extract variance components
  var_comp <- VarCorr(model)
  participant_var <- as.numeric(var_comp[1,1])
  residual_var <- as.numeric(var_comp[2,1])
  total_var <- participant_var + residual_var
  
  cat("\nVARIANCE COMPONENTS:\n")
  cat("─────────────────────────────────────────────────────────────────────────\n")
  cat(sprintf("Participant (Random Intercept): σ² = %.6f (SD = %.4f)\n", 
              participant_var, sqrt(participant_var)))
  cat(sprintf("Residual:                       σ² = %.6f (SD = %.4f)\n", 
              residual_var, sqrt(residual_var)))
  cat(sprintf("Total:                          σ² = %.6f\n", total_var))
  cat(sprintf("\nIntraclass Correlation (ICC): %.3f\n", participant_var / total_var))
  cat(sprintf("  → %.1f%% of variance is between participants\n", 100 * participant_var / total_var))
  cat(sprintf("  → %.1f%% of variance is within participants\n", 100 * residual_var / total_var))
  
  # Model fit statistics
  cat("\nMODEL FIT:\n")
  cat("─────────────────────────────────────────────────────────────────────────\n")
  cat(sprintf("Log-Likelihood: %.2f\n", model_summary$logLik))
  cat(sprintf("AIC: %.1f\n", AIC(model)))
  cat(sprintf("BIC: %.1f\n", BIC(model)))
  cat(sprintf("Number of observations: %d\n", nrow(model_data)))
  cat(sprintf("Number of participants: %d\n", length(unique(model_data$participant_id))))
  
  # Calculate R² for each condition
  cat("\n\nMODEL PREDICTIONS BY CONDITION:\n")
  cat("─────────────────────────────────────────────────────────────────────────\n")
  
  # Calculate R² by condition
  r2_by_condition <- model_data %>%
    mutate(predicted = fitted(model)) %>%
    group_by(condition) %>%
    summarise(
      r2 = cor(accuracy, predicted)^2,
      rmse = sqrt(mean((accuracy - predicted)^2)),
      n = n()
    ) %>%
    mutate(condition_label = condition_labels[condition])
  
  for(i in 1:nrow(r2_by_condition)) {
    cat(sprintf("%-40s R² = %.3f, RMSE = %.4f (n = %d)\n",
                r2_by_condition$condition_label[i],
                r2_by_condition$r2[i],
                r2_by_condition$rmse[i],
                r2_by_condition$n[i]))
  }
  
  # Overall R²
  overall_r2 <- cor(model_data$accuracy, fitted(model))^2
  cat(sprintf("\nOverall R²: %.3f\n", overall_r2))
  
  # Model predictions vs actual plot with R² annotations
  model_data$predicted <- fitted(model)
  
  # Calculate R² for each condition for the plot
  r2_data <- model_data %>%
    group_by(condition) %>%
    summarise(r2 = cor(accuracy, predicted)^2) %>%
    mutate(
      condition_label = condition_labels[condition],
      r2_label = sprintf("R² = %.3f", r2)
    )
  
  # Sample data for plotting
  plot_data <- model_data %>%
    group_by(participant_id) %>%
    mutate(participant_num = cur_group_id()) %>%
    ungroup() %>%
    filter(participant_num %% 3 == 0) %>%
    mutate(condition_label = condition_labels[condition])
  
  # Create prediction plot with R² values
  pred_plot <- ggplot(plot_data, aes(x = accuracy, y = predicted)) +
    geom_point(alpha = 0.3, size = 1.5, color = "#2E86AB") +
    geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red", linewidth = 1) +
    geom_text(data = r2_data, aes(x = 0.4, y = 0.95, label = r2_label),
              hjust = 0, vjust = 1, size = 4, fontface = "bold", color = "red") +
    facet_wrap(~ condition_label, nrow = 2) +
    scale_x_continuous(labels = scales::percent_format(accuracy = 1), 
                       limits = c(0.3, 1)) +
    scale_y_continuous(labels = scales::percent_format(accuracy = 1), 
                       limits = c(0.3, 1)) +
    labs(
      x = "Actual Accuracy",
      y = "Predicted Accuracy",
      title = "Mixed Effects Model: Predicted vs Actual Accuracy",
      subtitle = "Red dashed line represents perfect prediction (Native speakers only)"
    ) +
    theme_minimal(base_size = 12) +
    theme(
      strip.text = element_text(face = "bold", size = 10),
      panel.spacing = unit(1, "lines"),
      panel.grid.minor = element_blank(),
      panel.grid.major = element_line(color = "gray95"),
      plot.title = element_text(size = 16, face = "bold"),
      plot.subtitle = element_text(size = 12, color = "gray40"),
      aspect.ratio = 1
    )
  
  print(pred_plot)
  
} else {
  cat("Note: nlme package not available. Mixed-effects analysis skipped.\n")
  cat("To run this analysis, install the package with:\n")
  cat("install.packages('nlme')\n")
}
## === DETAILED MIXED EFFECTS MODEL OUTPUT ===
## 
## FIXED EFFECTS:
## ─────────────────────────────────────────────────────────────────────────
## (Intercept)                                        β =  0.8839 (SE = 0.0070), t(24180) = 126.67, p = 0.0000 ***
## conditionmulti-multi-all-random                    β = -0.0238 (SE = 0.0102), t(828) =  -2.33, p = 0.0198 *
## conditionsingle-multi-excl-single                  β = -0.0151 (SE = 0.0100), t(828) =  -1.50, p = 0.1328 
## conditionsingle-single-diff-diff-variety           β =  0.0014 (SE = 0.0099), t(828) =   0.14, p = 0.8916 
## conditionsingle-single-diff-same-variety           β = -0.0207 (SE = 0.0101), t(828) =  -2.04, p = 0.0419 *
## conditionsingle-single-same                        β =  0.0156 (SE = 0.0101), t(828) =   1.54, p = 0.1247 
## phaseTraining                                      β = -0.0189 (SE = 0.0045), t(24180) =  -4.22, p = 0.0000 ***
## conditionmulti-multi-all-random:phaseTraining      β =  0.0126 (SE = 0.0065), t(24180) =   1.92, p = 0.0549 
## conditionsingle-multi-excl-single:phaseTraining    β =  0.0147 (SE = 0.0064), t(24180) =   2.28, p = 0.0229 *
## conditionsingle-single-diff-diff-variety:phaseTraining β =  0.0179 (SE = 0.0064), t(24180) =   2.82, p = 0.0048 **
## conditionsingle-single-diff-same-variety:phaseTraining β =  0.0110 (SE = 0.0065), t(24180) =   1.70, p = 0.0899 
## conditionsingle-single-same:phaseTraining          β =  0.0040 (SE = 0.0065), t(24180) =   0.62, p = 0.5369 
## 
## 
## VARIANCE COMPONENTS:
## ─────────────────────────────────────────────────────────────────────────
## Participant (Random Intercept): σ² = 0.005759 (SD = 0.0759)
## Residual:                       σ² = 0.022420 (SD = 0.1497)
## Total:                          σ² = 0.028179
## 
## Intraclass Correlation (ICC): 0.204
##   → 20.4% of variance is between participants
##   → 79.6% of variance is within participants
## 
## MODEL FIT:
## ─────────────────────────────────────────────────────────────────────────
## Log-Likelihood: 11061.29
## AIC: -22094.6
## BIC: -21980.8
## Number of observations: 25020
## Number of participants: 834
## 
## 
## MODEL PREDICTIONS BY CONDITION:
## ─────────────────────────────────────────────────────────────────────────
## Same Speaker                             R² = 0.176, RMSE = 0.1483 (n = 4470)
## Different Speaker (Same Variety)         R² = 0.208, RMSE = 0.1574 (n = 3930)
## Different Speaker (Diff Variety)         R² = 0.328, RMSE = 0.1477 (n = 4170)
## Single→Multi                           R² = 0.147, RMSE = 0.1424 (n = 4410)
## Multi→Multi                            R² = 0.254, RMSE = 0.1572 (n = 4020)
## Multi→Single                           R² = 0.249, RMSE = 0.1309 (n = 4020)
## 
## Overall R²: 0.235
## Warning: Removed 86 rows containing missing values or values outside the scale range
## (`geom_point()`).

Model Interpretation

The mixed-effects model reveals several key findings:

Baseline Performance: The intercept indicates that the reference condition (multi-excl-single-single in Testing phase) achieved 88.4% accuracy.

Phase Effect: The Training phase showed significantly lower accuracy than Testing (-1.9%, p < .001), confirming the overall adaptation effect. This negative coefficient indicates that participants improved from training to testing, as expected when perceptual learning occurs.

Condition Differences: The Multi→Multi condition performed significantly worse than the reference (-2.4%, p = .02), while the Same-Variety condition also showed lower performance (-2.1%, p = .04).

Variance Structure: The ICC of 0.204 indicates substantial individual differences, with 20.4% of variance attributable to between-participant differences and 79.6% to within-participant variation (trial-to-trial variability).

Model Fit: The overall R² of 0.235 suggests the model explains about 24% of the variance in accuracy. R² values varied considerably by condition, from 0.147 (Single→Multi) to 0.328 (Different Speaker, Diff Variety), indicating that the model’s predictive accuracy differs across experimental conditions.

Learning Curves

# Calculate trial-by-trial accuracy for each condition (native speakers only)
trial_data <- df_main %>%
  group_by(condition, overall_trial_number) %>%
  summarise(
    mean_accuracy = mean(accuracy),
    se = sd(accuracy) / sqrt(n()),
    n = n(),
    .groups = 'drop'
  ) %>%
  mutate(
    condition_label = condition_labels[condition],
    phase = ifelse(overall_trial_number <= 15, "Training", "Testing")
  )

# Create enhanced faceted plot
p3 <- ggplot(trial_data, aes(x = overall_trial_number, y = mean_accuracy)) +
  geom_ribbon(aes(ymin = mean_accuracy - se, ymax = mean_accuracy + se, 
                  fill = condition), alpha = 0.2) +
  geom_line(aes(color = condition), linewidth = 1.2) +
  geom_point(aes(color = condition), size = 1.8, alpha = 0.8) +
  geom_vline(xintercept = 15.5, linetype = "dashed", alpha = 0.4, linewidth = 0.8) +
  facet_wrap(~ condition_label, nrow = 2) +
  scale_color_manual(values = condition_colors, guide = "none") +
  scale_fill_manual(values = condition_colors, guide = "none") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), 
                     limits = c(0.7, 1),
                     breaks = seq(0.7, 1, 0.05)) +
  labs(
    x = "Trial Number",
    y = "Accuracy",
    title = "Learning Curves by Condition",
    subtitle = "Vertical line indicates training-testing phase transition"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    strip.text = element_text(face = "bold", size = 11, margin = margin(b = 5)),
    strip.background = element_rect(fill = "gray97", color = NA),
    panel.spacing = unit(1.2, "lines"),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray95", linewidth = 0.5),
    axis.text = element_text(size = 10, color = "gray20"),
    axis.title = element_text(size = 12),
    plot.title = element_text(size = 18, face = "bold", color = "gray10"),
    plot.subtitle = element_text(size = 13, color = "gray40", margin = margin(b = 15)),
    plot.background = element_rect(fill = "white", color = NA),
    panel.background = element_rect(fill = "white", color = NA)
  )

print(p3)

Native vs Non-Native Speaker Comparison

# Using the unfiltered data for native vs non-native comparison
# Calculate overall performance by native status and phase
overall_comparison <- df_main_all %>%
  group_by(native_status, phase) %>%
  summarise(
    mean_accuracy = mean(accuracy),
    se_accuracy = sd(accuracy) / sqrt(n()),
    n = n(),
    .groups = 'drop'
  ) %>%
  mutate(
    phase = factor(phase, levels = c("Training", "Testing"))  # Ensure correct order
  )

# Get participant counts for legend
n_native <- n_distinct(df_main_all %>% filter(native_status == "Native") %>% pull(participant_id))
n_nonnative <- n_distinct(df_main_all %>% filter(native_status == "Non-Native") %>% pull(participant_id))

# Store values for interpretation
native_train <- overall_comparison %>% 
  filter(native_status == "Native" & phase == "Training") %>% 
  pull(mean_accuracy) * 100
native_test <- overall_comparison %>% 
  filter(native_status == "Native" & phase == "Testing") %>% 
  pull(mean_accuracy) * 100
nonnative_train <- overall_comparison %>% 
  filter(native_status == "Non-Native" & phase == "Training") %>% 
  pull(mean_accuracy) * 100
nonnative_test <- overall_comparison %>% 
  filter(native_status == "Non-Native" & phase == "Testing") %>% 
  pull(mean_accuracy) * 100

# Test for statistical difference
native_data <- df_main_all %>% filter(native_status == "Native")
nonnative_data <- df_main_all %>% filter(native_status == "Non-Native")

# Store t-test results
train_t <- t.test(
  native_data %>% filter(phase == "Training") %>% pull(accuracy),
  nonnative_data %>% filter(phase == "Training") %>% pull(accuracy)
)
test_t <- t.test(
  native_data %>% filter(phase == "Testing") %>% pull(accuracy),
  nonnative_data %>% filter(phase == "Testing") %>% pull(accuracy)
)

# Create comparison plot with updated legend
p_comparison <- ggplot(overall_comparison, aes(x = phase, y = mean_accuracy, 
                                               color = native_status, group = native_status)) +
  geom_line(linewidth = 2) +
  geom_point(size = 4) +
  geom_errorbar(aes(ymin = mean_accuracy - se_accuracy, 
                    ymax = mean_accuracy + se_accuracy),
                width = 0.1, linewidth = 1) +
  scale_color_manual(
    values = c("Native" = "#2E86AB", "Non-Native" = "#F18F01"),
    labels = c(
      "Native" = paste0("Native (n = ", n_native, ")"),
      "Non-Native" = paste0("Non-Native (n = ", n_nonnative, ")")
    ),
    name = "Speaker Status"
  ) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1),
                     limits = c(0.8, 0.9)) +
  labs(
    x = "Phase",
    y = "Mean Accuracy",
    title = "Overall Performance: Native vs Non-Native English Speakers",
    subtitle = "Error bars represent standard error"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "top",
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray90"),
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 13),
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 13, color = "gray40")
  )

print(p_comparison)

The comparison between native and non-native English speakers reveals striking differences in both performance levels and adaptation patterns:

Baseline Performance: Native speakers significantly outperformed non-native speakers in both phases. Training: 86.8% vs 84.0% (difference = 2.8%, t = 5.16, p < .001). Testing: 87.7% vs 83.5% (difference = 4.2%, t = 7.54, p < .001).

Adaptation Patterns: While native speakers showed positive adaptation (0.9% improvement), non-native speakers showed negative adaptation (-0.5% decline). This divergent pattern suggests fundamentally different processing mechanisms between the two groups.

Implications: The performance gap widened from training to testing, indicating that the experimental manipulation may have been more challenging for non-native speakers, possibly due to increased cognitive load or less flexible perceptual adaptation mechanisms.

Distribution of Trial Accuracies

# Create histogram of all trial accuracies (native speakers only)
p_hist <- ggplot(df_main, aes(x = accuracy)) +
  geom_histogram(binwidth = 0.05, fill = "#2E86AB", alpha = 0.8, 
                 color = "white", boundary = 0) +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1),
                     breaks = seq(0, 1, 0.1)) +
  scale_y_continuous(expand = c(0, 0)) +
  labs(
    x = "Accuracy",
    y = "Number of Trials",
    title = "Distribution of Trial Accuracies",
    subtitle = sprintf("All trials from native English speakers (n = %d trials)", nrow(df_main))
  ) +
  theme_minimal(base_size = 14) +
  theme(
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.major.y = element_line(color = "gray90"),
    axis.text = element_text(size = 11),
    axis.title = element_text(size = 13),
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 13, color = "gray40")
  )

print(p_hist)

# Calculate and store statistics
acc_mean <- mean(df_main$accuracy) * 100
acc_median <- median(df_main$accuracy) * 100
acc_sd <- sd(df_main$accuracy) * 100
acc_min <- min(df_main$accuracy) * 100
acc_max <- max(df_main$accuracy) * 100
perfect_trials <- sum(df_main$accuracy == 1)
perfect_pct <- 100 * perfect_trials / nrow(df_main)

The distribution of trial accuracies reveals several important characteristics of L2 speech perception performance:

Central Tendency: The mean accuracy of 87.3% with a median of 93.3% indicates a right-skewed distribution, suggesting that while most trials were highly accurate, a subset of trials proved particularly challenging.

Variability: The standard deviation of 16.8% demonstrates substantial trial-to-trial variability, reflecting the diverse challenges posed by different speakers, sentences, and experimental conditions.

Ceiling Effects: Remarkably, 39.7% of trials (9,940 trials) achieved perfect accuracy, suggesting that many L2-accented sentences were fully intelligible to native English listeners despite the accent.

Range: Accuracy ranged from 0.0% to 100.0%, with the presence of zero-accuracy trials indicating complete failures of speech perception for certain speaker-sentence combinations.

Speaker Effects

# Calculate speaker-level statistics (based on native speaker responses)
speaker_stats <- df_main %>%
  group_by(speaker_id) %>%
  summarise(
    mean_accuracy = mean(accuracy),
    se = sd(accuracy) / sqrt(n()),
    n = n()
  ) %>%
  arrange(desc(mean_accuracy))

# Create enhanced speaker barplot
p4 <- ggplot(speaker_stats, aes(x = reorder(speaker_id, mean_accuracy), y = mean_accuracy)) +
  geom_bar(stat = "identity", aes(fill = mean_accuracy), alpha = 0.85, width = 0.8) +
  geom_errorbar(aes(ymin = mean_accuracy - se, ymax = mean_accuracy + se),
                width = 0.2, linewidth = 0.6, color = "gray30") +
  geom_hline(yintercept = mean(df_main$accuracy), 
             linetype = "dashed", color = "#E63946", linewidth = 1) +
  scale_fill_gradient2(low = "#2E86AB", mid = "#F77F00", high = "#06D6A0",
                       midpoint = mean(df_main$accuracy),
                       guide = "none") +
  coord_flip() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1),
                     breaks = seq(0.7, 1, 0.05)) +
  labs(
    x = NULL,
    y = "Mean Accuracy",
    title = "Speaker Intelligibility Ranking",
    subtitle = "Red dashed line indicates grand mean across all speakers"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_line(color = "gray90", linewidth = 0.5),
    axis.text.y = element_text(size = 10, color = "gray20"),
    axis.text.x = element_text(size = 10, color = "gray20"),
    axis.title.x = element_text(size = 12, margin = margin(t = 10)),
    plot.title = element_text(size = 16, face = "bold", color = "gray10"),
    plot.subtitle = element_text(size = 12, color = "gray40", margin = margin(b = 10)),
    plot.background = element_rect(fill = "white", color = NA),
    panel.background = element_rect(fill = "white", color = NA)
  )

print(p4)

# Calculate statistics
n_speakers <- nrow(speaker_stats)
min_acc <- min(speaker_stats$mean_accuracy) * 100
max_acc <- max(speaker_stats$mean_accuracy) * 100
speaker_var <- var(speaker_stats$mean_accuracy)

The analysis of 15 L2 speakers reveals substantial individual differences in intelligibility:

Range: Speaker intelligibility varied from 74.7% to 93.1%, representing an 18.4 percentage point spread. This wide range underscores the importance of speaker selection in L2 speech perception research.

Variance: The speaker variance of 0.0025 indicates that individual speaker characteristics contribute substantially to overall performance variability, beyond the effects of experimental condition.

Implications: These speaker effects suggest that perceptual adaptation to L2 speech may be heavily influenced by the specific acoustic-phonetic characteristics of individual talkers, rather than general properties of L1-influenced speech patterns.

Summary of Key Findings

# Overall statistics
overall_stats <- df_main %>%
  group_by(phase) %>%
  summarise(
    mean_accuracy = mean(accuracy),
    sd_accuracy = sd(accuracy),
    n = n()
  )

# Store key values
n_participants <- n_distinct(df_main$participant_id)
n_excluded <- n_distinct(df_main_all$participant_id) - n_distinct(df_main$participant_id)
n_trials <- nrow(df_main)
overall_acc <- mean(df_main$accuracy) * 100
overall_sd <- sd(df_main$accuracy) * 100
train_acc <- overall_stats$mean_accuracy[overall_stats$phase == "Training"] * 100
test_acc <- overall_stats$mean_accuracy[overall_stats$phase == "Testing"] * 100

# Adaptation cost analysis
costs <- adaptation_summary %>%
  filter(condition %in% c("single-multi-excl-single", "multi-excl-single-single")) %>%
  pull(mean_benefit)

# Note: p_values[3] contains the Same-Variety vs Different-Variety comparison (H2 test)

Experiment Summary

Sample Characteristics: - Analyzed: 834 native English speakers - Excluded: 83 non-native speakers (separate analysis) - Total trials: 25,020 (native speakers only) - Conditions: 6 experimental conditions (~139 participants each)

Overall Performance: - Grand mean accuracy: 87.3% (SD = 16.8%) - Training phase: 86.8% - Testing phase: 87.7% - Overall adaptation benefit: +0.9 percentage points

Hypothesis Tests:

H1 PARTIALLY SUPPORTED: Talker-specific adaptation found - Evidence: Same speaker condition outperformed Same-Variety condition - Same vs. Same-Variety: p = .001 - Same vs. Diff-Variety: p = .114 (not significant)

H2 NOT SUPPORTED: Variety-general adaptation not found - Different-Variety (88.5%) outperformed Same-Variety (86.3%), p = .038 - This is opposite to the predicted direction - Shared L1 background hindered rather than facilitated generalization

H3 NOT CLEARLY SUPPORTED: Specialization patterns unclear - Single→Multi adaptation: 0.43% (modest benefit, not cost) - Multi→Single adaptation: 1.89% (strong benefit) - Both conditions showed benefits rather than the expected cost-benefit tradeoff

Key Insights: 1. All conditions showed positive adaptation effects 2. Multi→Single training showed the strongest benefits (1.89%) 3. Evidence against variety-general adaptation (H2) - different L1 > same L1 4. Substantial individual differences (ICC = 0.204) 5. Wide speaker intelligibility range (74.7% - 93.1%) 6. Native speakers significantly outperformed non-native speakers 7. Nearly 40% of trials achieved perfect accuracy

Discussion

Summary of Replication Attempt

This partial replication of Bradlow et al. (2023) examined perceptual adaptation to L2-accented speech across six experimental conditions manipulating speaker variability during training and testing phases. From 1,370 complete Prolific submissions, 917 participants (33.1% exclusion rate) met our preregistered inclusion criteria. Our main analyses focus on the 834 native English speakers (90.9% of the valid sample), with a separate comparison examining differences between native and non-native listeners. This approach ensures our findings are directly comparable to the original study’s focus on L1 English speakers while also providing insights into how language background affects perceptual adaptation.

Our primary finding supports the original study’s conclusion that exposure configuration significantly impacts perceptual adaptation among native English speakers. Notably, we observed an overall improvement from training (86.8%) to testing (87.7%) phases, indicating general perceptual learning across the experiment. The absolute adaptation benefit analysis revealed that all conditions showed positive adaptations, though the magnitude varied considerably (ranging from 0.10% to 1.89%).

Key findings include: - The Multi→Single condition showed the highest adaptation benefit (1.89%), suggesting that training with multiple speakers creates particularly robust representations that transfer well to novel single speakers - The Same Speaker condition showed strong adaptation (1.49%), providing partial support for talker-specific adaptation (H1) - Surprisingly, the Same-Variety condition performed worse than the Different-Variety condition, contradicting the variety-general adaptation hypothesis (H2)

This last finding is particularly intriguing as it suggests that matched L1 backgrounds may create interference rather than facilitation in cross-talker generalization, possibly due to listeners forming overly specific expectations about L1-influenced speech patterns.

Commentary

Several key insights emerge from our replication focusing on native English speakers:

  1. Talker-Specific Adaptation (H1): We found mixed evidence for talker-specific benefits among native speakers. While the same-speaker condition significantly outperformed the same-variety condition, it did not significantly differ from the different-variety condition. This partial support suggests that talker-specific adaptation exists but may be more nuanced than originally hypothesized.

  2. Variety-General Effects (H2): Our explicit test of H2 yielded surprising results that contradict the hypothesis. The different-variety condition (88.5%) significantly outperformed the same-variety condition (86.3%, p = .038), suggesting that shared L1 background actually hindered rather than facilitated cross-talker generalization. This unexpected finding challenges fundamental assumptions about how listeners use linguistic similarity in perceptual adaptation.

  3. Specialization Tradeoffs (H3): Our results do not support the predicted cost-benefit tradeoff. Both Single→Multi and Multi→Single conditions showed positive adaptation benefits, with Multi→Single showing particularly strong gains (1.89%). This suggests that multiple-speaker training may create more flexible representations without incurring specialization costs.

  4. Native vs Non-Native Differences: Our comparison revealed that non-native English speakers showed consistently lower accuracy across both training and testing phases, with a slight negative adaptation trend (-0.5%) compared to native speakers’ positive adaptation (+0.9%). This suggests fundamental differences in L2 speech processing that persist even with perceptual adaptation opportunities.

  5. Methodological Considerations: Our use of the L2-ARCTIC corpus, while different from the original ALLSSTAR materials, provided well-controlled stimuli with balanced speaker representation. The addition of time pressure (15-second response window) may have increased cognitive load but also provided a more naturalistic listening scenario.

  6. Speaker Variability: The substantial speaker effects observed (with accuracy ranging from approximately 75% to 95% across speakers) highlight the importance of controlling for talker characteristics in L2 speech perception research.

Limitations and Future Directions

Several important limitations should be noted: - Reduced statistical power: With only 139 participants per condition (vs. planned 200), our power to detect medium effects dropped from 84% to 72% - Lack of consolidation period: The absence of the 11-hour delay between training and testing phases may have affected the magnitude and nature of adaptation effects - Different task demands: Full sentence transcription (vs. keyword identification) likely taps into different cognitive processes and may explain some divergence from original findings - Web-based format: While allowing for larger sample sizes, this reduced experimental control compared to laboratory settings - Conceptual rather than direct replication: The numerous methodological differences (corpus, noise conditions, time pressure, response format) mean this study tests similar concepts rather than directly replicating the original

Future work should explore the time course of adaptation with multiple testing intervals, investigate whether these adaptation patterns hold for more naturalistic conversational speech materials, and examine why shared L1 background unexpectedly hindered rather than facilitated cross-talker generalization.

This conceptual replication provides evidence that perceptual adaptation to L2 speech is influenced by the variability of training exposure among native English speakers, though the patterns are more complex than originally hypothesized. Most notably, we found evidence against the variety-general adaptation hypothesis (H2), with shared L1 background actually hindering cross-talker generalization. The unexpected finding that multiple-speaker training led to the strongest adaptation benefits challenges assumptions about specialization costs. The observed differences between native and non-native listeners further suggest that adaptation mechanisms may operate differently depending on listeners’ linguistic backgrounds, warranting future investigation into the interaction between L1 experience and perceptual flexibility.

## Extension – Putting the Four Core Findings of Bradlow et al. (2023) to the Test

In this extension we ask whether our replication reproduces the four headline patterns reported by Bradlow, Bassard & Paller (2023).
All statistics are based on the native‑speaker subset used above.

1 Low‑ vs High‑Variability Training

Original claim Single‑talker (low‑var) exposure can be sufficient for cross‑talker adaptation, and multi‑talker (high‑var) exposure does not always guarantee it.

library(lme4)

## Tag each TEST trial as "generalize" (new talker) or not
df_claim1 <- df_main %>%
  mutate(generalize = case_when(
    phase == "Training"                    ~ NA,
    grepl("^single-single", condition)     ~ (speaker_id != lag(speaker_id, 1)),
    condition == "multi-multi-all-random"  ~ FALSE,
    condition == "multi-excl-single-single"|
    condition == "single-multi-excl-single"~ TRUE,
    TRUE                                   ~ FALSE
  ))

## Mixed‑effects logistic model: Test‑phase only
m_claim1 <- glmer(
  accuracy ~ generalize *
    (grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))) +
    (1|participant_id) + (1|stimulus_id),
  data   = filter(df_claim1, phase == "Testing"),
  family = binomial
)
summary(m_claim1)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: 
## accuracy ~ generalize * (grepl("^single", condition) %>% factor(levels = c(FALSE,  
##     TRUE))) + (1 | participant_id) + (1 | stimulus_id)
##    Data: filter(df_claim1, phase == "Testing")
## 
##      AIC      BIC   logLik deviance df.resid 
##   4583.0   4627.6  -2285.5   4571.0    12504 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.8646 -0.6892 -0.0927  0.2056  0.2505 
## 
## Random effects:
##  Groups         Name        Variance Std.Dev.
##  participant_id (Intercept) 0        0       
##  stimulus_id    (Intercept) 0        0       
## Number of obs: 12510, groups:  participant_id, 834; stimulus_id, 450
## 
## Fixed effects:
##                                                                                    Estimate
## (Intercept)                                                                          2.7688
## generalizeTRUE                                                                       0.4496
## grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))TRUE                  0.3951
## generalizeTRUE:grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))TRUE  -0.6753
##                                                                                    Std. Error
## (Intercept)                                                                            0.0957
## generalizeTRUE                                                                         0.1458
## grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))TRUE                    0.1161
## generalizeTRUE:grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))TRUE     0.1855
##                                                                                    z value
## (Intercept)                                                                         28.931
## generalizeTRUE                                                                       3.084
## grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))TRUE                  3.403
## generalizeTRUE:grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))TRUE  -3.640
##                                                                                    Pr(>|z|)
## (Intercept)                                                                         < 2e-16
## generalizeTRUE                                                                     0.002042
## grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))TRUE                0.000666
## generalizeTRUE:grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))TRUE 0.000273
##                                                                                       
## (Intercept)                                                                        ***
## generalizeTRUE                                                                     ** 
## grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))TRUE                ***
## generalizeTRUE:grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))TRUE ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) gnTRUE g(c%f=cT
## generlzTRUE -0.656                
## g("^"c%f=cT -0.824  0.541         
## gTRUEc%f=cT  0.516 -0.786 -0.626  
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

Interpretation

  • generalizeTRUE is positive & significant after multi‑talker training (listeners profit from variability).
  • The negative interaction shows that after single‑talker exposure, generalization suffers (≈ –2 pp).

Replication verdict – Claim 1: Partially replicated. High variability again helps, but—unlike Bradlow et al.—our single‑talker training was not sufficient for equal cross‑talker gains.


2 Talker‑Specific Advantage

Original claim Matched training‑testing talker pairs do not always beat mismatched pairs; the advantage is inconsistent.

single_levels <- c("single-single-same",
                   "single-single-diff-same-variety",
                   "single-single-diff-diff-variety")

adapt_single <- adaptation_summary %>%
  filter(condition %in% single_levels) %>%
  mutate(cond = factor(condition, levels = single_levels))

pairwise.t.test(adapt_single$mean_benefit,
                adapt_single$cond,
                p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  adapt_single$mean_benefit and adapt_single$cond 
## 
##                                 single-single-same
## single-single-diff-same-variety -                 
## single-single-diff-diff-variety -                 
##                                 single-single-diff-same-variety
## single-single-diff-same-variety -                              
## single-single-diff-diff-variety -                              
## 
## P value adjustment method: bonferroni

Interpretation

  • Same Speaker > Same Variety (Δ = 3.6 pp, p =.001).
  • Same Speaker vs Diff Variety = n.s.

Replication verdict – Claim 2: Partially replicated. A talker‑specific boost appears, but it is selective, echoing the original pattern.


3 Symmetry of Generalization

Original claim Generalization was asymmetric: A→B ≠ B→A for some pairs.

## Participant‑level adaptation scores (already in adaptation_data)
adapt_participant <- adaptation_data %>%
  select(condition, participant_id, adaptation_benefit) %>%
  filter(!is.na(adaptation_benefit))

symmetry_pairs <- list(
  c("single-multi-excl-single",  "multi-excl-single-single"),
  c("single-single-diff-same-variety", "single-single-diff-diff-variety")
)

for (p in symmetry_pairs) {
  cat("\n── Pair:", paste(p, collapse = "  ↔  "), "──\n")
  A <- adapt_participant %>% filter(condition == p[1]) %>% pull(adaptation_benefit)
  B <- adapt_participant %>% filter(condition == p[2]) %>% pull(adaptation_benefit)
  print(t.test(A, B))
}
## 
## ── Pair: single-multi-excl-single  ↔  multi-excl-single-single ──
## 
##  Welch Two Sample t-test
## 
## data:  A and B
## t = -1.6003, df = 279.37, p-value = 0.1107
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.032719509  0.003375682
## sample estimates:
##   mean of x   mean of y 
## 0.004250124 0.018922037 
## 
## 
## ── Pair: single-single-diff-same-variety  ↔  single-single-diff-diff-variety ──
## 
##  Welch Two Sample t-test
## 
## data:  A and B
## t = 0.65941, df = 267.55, p-value = 0.5102
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01368469  0.02746733
## sample estimates:
##    mean of x    mean of y 
## 0.0078829262 0.0009916072

Interpretation

Neither contrast is significant; adaptation is symmetric in our data.

Replication verdict – Claim 3: Not replicated. We do not reproduce the directional asymmetries reported by Bradlow et al.


4 Training‑Phase Intelligibility as Moderator

Original claim Talkers who are difficult during training produce smaller downstream gains (positive intelligibility → adaptation correlation).

train_intel <- df_main %>%
  filter(phase == "Training") %>%
  group_by(speaker_id) %>%
  summarise(train_acc = mean(accuracy))

test_general <- df_main %>%
  filter(phase == "Testing") %>%
  group_by(speaker_id) %>%
  summarise(test_acc = mean(accuracy))

intel_adapt <- left_join(train_intel, test_general, by = "speaker_id") %>%
  mutate(adapt_gain = test_acc - train_acc)

plot(intel_adapt$train_acc, intel_adapt$adapt_gain,
     xlab = "Training‑Phase Intelligibility",
     ylab = "Generalization Gain",
     pch  = 19)
abline(lm(adapt_gain ~ train_acc, intel_adapt), col = "red")

cor.test(~ train_acc + adapt_gain, data = intel_adapt)
## 
##  Pearson's product-moment correlation
## 
## data:  train_acc and adapt_gain
## t = -0.50562, df = 13, p-value = 0.6216
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6078923  0.4019850
## sample estimates:
##        cor 
## -0.1388753

Interpretation

Correlation r = –0.14, p = .62 → no relationship.

Replication verdict – Claim 4: Not replicated. We find no evidence that lower training intelligibility predicts weaker generalization.


Replication Verdicts at a Glance

Bradlow et al. finding Replication outcome Direction
1 Low‑var suffices; high‑var not magic Partial – high‑var > low‑var; low‑var not sufficient diverges
2 Talker‑specific edge inconsistent Partial – edge only over Same‑Variety converges
3 Generalization asymmetries No – effects symmetric diverges
4 Intelligibility moderates learning Nor ≈ 0 diverges

In sum, only the selective talker‑specific benefit (Finding 2) lines up neatly with Bradlow et al. (2023); the other three patterns either reverse or disappear in this web‑based, L2‑ARCTIC replication, underscoring the need to map the boundary conditions of perceptual adaptation to L2 speech.